Premise engages citizens to crowdsource data in their own communities, thereby illuminating local points of interest that are essential. Depending on the geographic region, critical facilities may not be easily discoverable with modern search engines or mapping services. The current case study is from a recent campaign in Mexico City, where contributors were asked to find and document pharmacies. Data from contributor submissions were submitted and analyzed in order to best approximate unique locations of pharmacies and to understand surrounding features through text extraction. This study should serve as a proof of concept, or prototype, of applied statistical methods to generate insights from campaign-driven crowdsourced data and images.
The project can be divided into two primary objectives: to extract meaning from images and text submitted by contributors, and to harmonize location data from geotags to devise a list of unique pharmacy locations. The desired output is a corpus of texts derived from the images, insights that be gleaned from them, and a final list of unique pharmacy locations with a corresponding confidence index.
Two forms of data were provided: a directory of 898 images submitted by a total of 233 participants, and a csv file with the same number of rows, each corresponding to exactly one of the images. The csv data combined metadata such as gps location and timestamp with form fields filled in by the participants themselves. The fields solicited questions such as how often users visited the pharmacy, whether they were confident in its quality, their opinion about the safety of the neighborhood, and so on. More importantly, users filled out the name field, which proved critical to identifying specific pharmacies and differentiating between pharmacies with different names that were clustered together.
[1] "X1"
[2] "campaign_id"
[3] "project_id"
[4] "form_id"
[5] "task_id"
[6] "sub_id"
[7] "user_id"
[8] "timestamp"
[9] "photo"
[10] "lat"
[11] "lon"
[12] "campaign_name"
[13] "task_title"
[14] "how safe do you feel in this area?"
[15] "how confident are you in the quality of this pharmacy?"
[16] "have you ever visited this pharmacy?"
[17] "please do not submit hospitals, clinics, or non-pharmacy healthcare entities."
[18] "is this pharmacy currently open?"
[19] "have you ever gotten medication or medical supplies at this pharmacy"
[20] "a pharmacy is a business or vendor where a pharmacist sells prescription and non-prescription medications."
[21] "name"
[22] "how often do you visit this pharmacy?"
[23] "cluster"
Initially, a map with geotagged locations and their associated images as pop-ups was created in Mapbox for exploratory analysis (link). After exploring the map and images, some important observations were made:
FIGURE 1. Unnamed pharmacy FIGURE 2. Farmacia del Dr. Ahorro
FIGURE 3. Cluster of pharmacies FIGURE 4. Poor quality photo
To quantify the number of blurred images, I used the OpenCV package in Python to apply variation of the Laplacian 1, a standard method to detect blurring. This revealed that about 8% of the total number were blurred, which as we will see has a direct impact on the quality of text able to be extracted.
Other assumptions
This study assumes that is GPS accuracy is within the normal range (~5m) and that participants are not using fake GPS or spoofing to mask their real location when taking photos.
The methodology consisted of five main components: text extraction, spatial clustering, text clustering, text matching, and building a confidence index. Text detection, extraction, and spatial clustering were performed in Python using EAST and OCR deep learning models, while the latter was performed with DBSCAN. Text manipulation, matching, and subsequent clustering was performed in R. A custom confidence index was then designed to describe the total strength of the evidence associated with each pharmacy location, as well as an explanation about how the statistic changes. The new and old locations were then plotted on a map.
FIGURE 5. Workflow diagram
The first step in text extraction is text detection or locating text in an image. For this study, EAST - An Efficient and Accurate Scene Text Detector - was used with an existing PyTorch implementation. 2. This is a robust deep learning method for text detection that performs well on unstructured text. First, images were loaded and preprocessed with OpenCV, and a pre-trained model was configured to bound all text identified in the images. Bounding boxes were then used to crop the images into their textual fragments, and each of these fragments was passed Tesseract, the OCR engine. The modified source code can be found in the author's Github repository.
FIGURE 6. Green bounding boxes indicated the text identified by EAST
Once bounding boxes were identified, Tesseract 3 - an open source OCR engine - was configured in Python to recognize text from the cropped bounding boxes for each image. The Spanish language pack was applied for this use case, and each text box was treated as a single text line.
pytesseract.image_to_string(cropped_image, config='--tessdata-dir tessdata --psm 7', lang="spa")
Spatial clustering was performed first, and then for points within those spatial clusters, we used a combination of partitioning around medoids (PAM) on Levenshtein distances.
Density-Based Spatial Clustering of Applications with Noise is an unsupervised machine learning algorithm that is good for grouping points together that are close to each other based on a distance measurement (Euclidean distance) and a minimum number of points. The most important feature of the algorithm is that it does not require one to specify the number of clusters a priori and that it joins sets of radius epsilon eps iteratively. Three input parameters are required:
eps: two points are considered neighbors if the distance between the two points is below the threshold eps
min_samples: The minimum number of neighbors a given point to have in order to be classified as a core point (clusters have a minimum size of 2)
metric: the metric used when calculating distance between instances (i.e. Euclidean distance)
For this study, 100 m was selected heuristically as the optimal radius. In our case, it would be worse to underestimate the size of the eps than to overestimate it becasue there is still a second step of clustering ahead (text-based) that can pare down our results if there are too many pharmacies in a cluster. If the initial spatial clusters are too small, then we risk starting off assuming ther are more clusters than there really are. 100 m is an ideal measurement because it is very difficult to get a clear photo of a sign/storefront beyond that range using a mobile phone.
It is important here to note that DBSCAN only takes us halfway to our objective - it is useful for identifying spatial clusters of photographs but we know from our observations (see Figure 3) that multiple pharmacies can also be spatially clustered, so how do we separate them? After spatial clustering, we need to try to find any existing clusters within clusters using pharmacy names. Following the next sections on text matching, the second part of clustering will be described in the section on PAM (partitioning around medoids).
Text from all fields - OCR output and user input - was cleaned by forcing to lowercase and by removing excess whitespace, punctuation, and Spanish stopwords. 4.
To answer the very basic yet important question - what is the name of the pharmacy? - we used Levenshtein Distance (LD) 5. LD is a measure of the similarity between two strings - the distance is an integer and represents the number of deletions, insertions, or substitutions (each one of these operations is counted as 1) required to transform the source string into the target string. The LD method is often used to correct spelling mistakes, which is why it was chosen as a string distance measure for this study. On one hand, the name input field was prone to human spelling error, and on the other, text output from OCR was very often incomplete, missing letters and including misrecognized letters. The LD method was applied to both.
LD was used to detect common misspellings and to differentiate the word f a r m a c i a from the spellings of proper pharmacy names. A distance matrix was created for all of the individual words in the name column (entries of multiple words were split and each extracted). A visual inspection of the distance matrix demonstrates that if distance is lower than or equal to 3, the words f a r m a c i a or a close variation of this word is present. We can also see that certain words like "farmastar" and "farmacia" have a low distance of 3, but we know that the former is the proper name of the pharmacy.
stringdist('farmastar', 'farmacia', method='lv')
[1] 3
| name column | LV distance |
|---|---|
| farmacia | 0 |
| farmcia | 1 |
| farmacias | 1 |
| frarmacia | 1 |
| farmamia | 1 |
| farmacity | 2 |
| farmcias | 2 |
| famacias | 2 |
| fatmacias | 2 |
| frmacias | 2 |
| farmaúnica | 3 |
| farma | 3 |
| farmastar | 3 |
| farmasimi | 3 |
| farmaluz | 3 |
| farmafw | 3 |
| farmamigo | 3 |
| farmafe | 3 |
| famcias | 3 |
| farmalyn | 3 |
proper pharmacy names.
Since we do not want to remove these words as they contain important information, we cannot simply use an arbitrary distance threshold such as 3. One solution is to manually compose a list of common misspellings from the distance matrix. This list was then used to mask the name column and derive the proper names of pharmacies (see below).
[1] "farmacia" "farmcia" "frarmacia" "farmacias" "farmcias" "farmcias"
[7] "farmamia" "frmacias" "famacias" "famcias" "fatmacias"
| original | minus_stop_words | proper_name |
|---|---|---|
| farmacias gi | farmacias gi | gi |
| farmacia guadalajara | farmacia guadalajara | guadalajara |
| farmacia | farmacia | |
| farmacias del ahorro | farmacias ahorro | ahorro |
| farmacia genericos | farmacia genericos | genericos |
Clustering within spatial clusters is the final step in locating pharmacies. DBSCAN was a good start, but it needs to be taken one step further. PAM or partitioning around medoids is a clustering algorithm that is a classical partitioning technique 6 , clustering the dataset of n objects in to k clusters. After creating a Levenshtein distance matrix for each unique spatial cluster, we can use apply PAM to cluster pharmacy names within these clusters.
The example below shows how PAM was able to split one 14 point cluster into two - a 12-point and 2-point cluster - based solely on pharmacy name. If pharmacies had blank names (this was only in 3% of submissions) and fell into a spatial cluster with named pharmacies, then they were assumed to belong to that cluster. In the case of multiple pharmacy clusters, a blank name would be assigned to the cluster whose spatial mean was closest to its location. In the case of a cluster of blank names, the final location name is an NA value.
Without ground truth measurements, it is not possible to build a predictive model and identify the most meaningful variables (number of unique user submissions, presence of correctly matching text in photo, etc) in locating pharmacies or other points of interests from crowdsourced data. However, one can take an heuristic approach based on the Central Limit Theorem 7 and make the assumption that the higher the sample size (individual contributions), the more the likely the sample mean (in this case the sample spatial mean) will equal the population mean. Although in this study, the population is an abstract concept (it could be represented, for example, by the total number of residents in Mexico City submitting photos of one pharmacy), the concept of distribution makes sense. As a general rule, sample sizes equal to or greater than 30 are sufficient for the Central Limit Theorem to hold. Based on this concept, a simple method can be devised to rate our confidence in location accuracy, which is undoubtedly a function of sample size.
Given a theoretical Pharmacy A, every incremental participant that documents the location of Pharmacy A after the first one is valuable up to a certain extent. Two unique users verifying the of location Pharmacy A is far better than one, and three is significantly better than two. Starting with zero, each incremental user's verification becomes slightly less valuable than the previous one, however, a new user "contributes" much more to the confidence score when the sample size is low. This can be represented mathematically as a limit function, where the score will always be between 0 and 1 but never converge to 1:
Thus, a sample size of 1 will acheive a score of 0, but a sample size of 2 will acheive 0.29, 3 will achieve 0.42, 4 a 0.5, and so on. Of course, this model should be tested with ground truth data, because there may be a better one that fits (maybe in reality less than 30 samples are needed to generate high confidence and the model can be replaced with one that it converges faster to 1).
POI matching should not have any bearing on location accuracy. It as apparent from the dataset that photos submitted at highly variable locations often show the exact same object (pharmacy) from the same angle. This means that we should avoid triangulating distances using the image and instead focus on the substance of the text. Is the object in the image actually the point-of-interest that we are interested in? We have used OCR to extract text from the images, the LD method to determine: 1. whether or not image text matches f a r m a c i a and, where possible, 2: whether the image text matches the name of the pharmacy input by the user.
Each photo can threfore have betwen 0 and 2 text matches, which should be treated using the same cumulative exponential logic as location accuracy. The more matches, the higher our confidence that a particular point-of-interest is what participants say it is. Given the difficulty of accurately extracting text from natural scenes, if one submission produces two matches, each match should contribute to n, and the same limit function should be applied to derive a POI score.
indexFun <- function(x) {
1- 1/sqrt(x)
}
p9 <- ggplot(data.frame(x = c(1, 30)), aes(x = x)) +
stat_function(fun = indexFun, colour = "dodgerblue3", size = 1.5) +
ggtitle("Confidence index of locations and POI type") +
xlab("Unique users / matches") + ylab("Confidence Index CI")
p9
Point-of-interest matching produced positive results, with over 50% of photos containing text approximating f a r m a c i a , and over 50% of photos containing text that matched the name field of the submission. However, because the number of unique users submitting photos for each location was relatively low, the location accuracy confidence score was
Below, a final map can be seen with most likely locations of pharmacies based on the data. The number of points was reduced from 898 to 544, or by 39%.
pal <- colorFactor(c("navy", "red"), domain = c("new", "old"))
final_rx$mean_lat= as.numeric(final_rx$mean_lat)
final_rx$mean_lon= as.numeric(final_rx$mean_lon)
xy <- final_rx[,c(2,3)]
spdf <- SpatialPointsDataFrame(coords = xy, data = final_rx,
proj4string = CRS("+proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0"))
rxy = raw[,c(6,5)]
rspdf <- SpatialPointsDataFrame(coords = rxy, data =raw,
proj4string = CRS("+proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0"))
m <- leaflet() %>% setView(lng = -99.1332, lat = 19.4326, zoom = 15)
m %>% addProviderTiles(providers$CartoDB.Positron) %>%
addMarkers(
data = spdf,
popup = spdf$name
) %>%
addCircleMarkers(
data = rspdf,
popup = rspdf$name,
radius = 3,
color = 'red',
stroke = FALSE, fillOpacity = 0.5
)
What other insights did we get from OCR? The texts also provided information about other services and points of interest located in and around the pharmacies. For example, the words could also tell us about surrounding points of interest - for example the word medico (eng. 'doctor') came up nearly 100 times and consultorio ('office') came up over 50 times. Also frequently present were words like recargas ('phone top up') and servicio ('service').
Overall, our mean location accuracy and point-of-interest confidence scores were low, with a mean of .03 and 0.28, respectively. These numbers are not as indicative of bad accuracy as they are of too low a sample size. We should still have confidence in low index scores, but they are a signal that we need to incentivize more participants to make more submissions.
Small sample sizes were the main limitation - the mean number of unique users per final location was only 1.2, which equates to less accuracy in the spatial means of clustered points. Furthermore, even with the latest technology, OCR is difficult to apply successfully to images in the wild, which means that it is not advisable to place too much importance on extracted text. In light of this, the current study relied heavily on the names submitted by the participants themseleves (the more unique participants submitting the same names at the same locations, the better).
More research should be done into industry grade OCR pipelines for text extraction. Participants should be incentivized to take more photos to increase n, and there should be a designated team member collecting ground truth data to help train future models.